﻿ Corpus of Contemporary Romanian Annotaon Levels CoRoLa and DRuKoLA Verginica Barbu Mitelu Research Instute for Arﬁcial Intelligence, Romanian Academy Dan Cristea Instute of Computer Science, Romanian Academy – Iasi branch Ruxandra Cosma Faculty of Foreign Languages, University of Bucharest CoRoLa & DRuKoLA • CoRoLa – Priority project of the Romanian Academy (2014-2017) – partners: • Research Instute for Arﬁcial Intelligence “Mihai Draganescu” – Bucharest • Instute for Computer Science – Iasi – volunteers from: • University of Bucharest, • “Alexandru Ioan Cuza” University of Iasi • University Politehnica of Bucharest • DRuKoLA – Alexander von Humboldt Foundaon, Research Group Linkage Programme (2016-2018) – partners: • University of Bucharest • Instut für Deutsche Sprache (Mannheim) • Research Instute for Arﬁcial Intelligence “Mihai Draganescu” - Bucharest • Instute for Computer Science – Iasi Corpus of Contemporary Romanian Language (2014-2017) • A huge collecon of wrien and oral (spoken and read) texts • Target: reference corpus • Contemporary: since 1945; actually ~1999 25 billion tokens a broad variety of text types with a quantave focus on newspaper texts and rapidly growing poron of computer mediated communicaon CoRoLA – currently developed at the instutes in Bucharest and in Iași 1 Harmonization of DeReKo and CoRoLa • CoRoLa and DeReKo metadata comply with CMDI (Component MetaData Infrastructure)5 and/or TEI-P5 standards • for the construcon of comparable corpora: syntaccal interoperability semanc interoperability (e g for the metadata categories that are used for the construcon of virtual corpora) The general procedure for the harmonizaon of data categories and value sets will be to deﬁne funcons that map the original data to more coarse-grained taxonomies addional harmonizaon on lower levels, e g for the integraon of CoRoLa into the KorAP corpus query engine, or for the adopon of the GGS query mechanism developed for CoRoLa as an auxiliary search engine to express constraints that would exploit the mul-layered annotaon of DeReKo 2 Query and analysis software (I) The soware that will be used for conducng the corpus linguisc research within DRuKoLA is the corpus query- and analysis plaorm KorAP (recently developed at the IDS - Bański et al , 2013; 2014) Besides KorAP’s more performance oriented features, like horizontal scalability with respect to an unbounded corpus size and any number of annotaon layers, two are parcularly fundamental for DRuKoLA: i) the ability to manage corpora that are physically located at diﬀerent places, in order to comply with typical license restricons (cf Kupietz et al 2014) and ii) the ability to dynamically create virtual sub-corpora based on text properes and to manage these virtual corpora in a persistent way, to e g allow for reusability and reproducibility 2 Query and analysis software (II) • features for the rather mono-linguisc research purposes will be integrated from recent and ongoing developments of the project partners • funconalies speciﬁcally required for cross- linguisc research tasks will ﬁrst be inventoried and developed during the project Linguistically Linked Open Data – resources that help to build text processors for COROLA • Thesaurus diconary of the Romanian language in electronic form – eDTLR hp://edtlr info uaic ro/ – train a word sense disambiguaon program • Romanian treebank with +10,000 sentences in 2015 – UAIC-RoDep Treebank hp://nlptools info uaic ro/Resources jsp – train a syntacc parser Linguistically Linked Open Data – resources that help to build text processors for COROLA • A corpus of semanc relaons – QuoVadis hp://nlptools info uaic ro/Resources jsp – train a program to recognise coreference, aﬀecve, kinship, social relaons Linguistically Linked Open Data – resources that help to build text processors for COROLA • A semancally annotated treebank – – correct syntacc roles on semanc ground Thanks Cătălina Mărănduc, PhD report, Nov 2016 Continuation of COROLa: a diachronic corpus • Acquision of textual data • including helped by a Cyrillic OCR (CyRo project in evaluaon) • Infer paradigmac morphology of old Romanian • from eDTLR citaons and other sources 